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Abstract 



Probability mass curves the data space with horizons!. Let / be a mul- 
tivariate probability density function with continuous second order partial 
derivatives. Consider the problem of estimating the true value of f(z) > 
at a single point z, from n independent observations. It is shown that, 
the fastest possible estimators (like the k-nearest neighbor and kernel) have 
minimum asymptotic mean square errors when the space of observations is 
thought as conformally curved. The optimal metric is shown to be generated 
by the Hessian of / in the regions where the Hessian is definite. Thus, the 
peaks and valleys of / are surrounded by singular horizons when the Hessian 
changes signature from Riemannian to pseudo-Riemannian. Adaptive esti- 
mators based on the optimal variable metric show considerable theoretical 
and practical improvements over traditional methods. The formulas simplify 
dramatically when the dimension of the data space is 4. The similarities 
with General Relativity are striking but possibly illusory at this point. How- 
ever, these results suggest that nonparametric density estimation may have 
something new to say about current physical theory. 
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1 Introduction 

During the past thirty years the theory of Nonparametrics has been domi- 
nating the scene in mathematical statistics. Parallel to the accelerating dis- 
covery of new technical results, a consensus has been growing among some 
researchers in the area, that we may be witnessing a promising solid road 
towards the elusive Universal Learning Machine (see e.g. 0,0). 

The queen of nonparametrics is density estimation. All the fundamen- 
tal ideas for solving the new problems of statistical estimation in functional 
spaces (smoothing, generalization, optimal minimax rates, etc.) already ap- 
pear in the problem of estimating the probability density (i.e. the model) 
from the observed data. More over, it is now well known that a solution 
for the density estimation problem automatically implies solutions for the 
problems of pattern recognition and nonparametric regression as well as for 
most problems that can be expressed as a functional of the density. 

In this paper I present a technical result, about optimal nonparametric 
density estimation, that shows at least at a formal level, a surprising simi- 
larity between nonparametrics and General Relativity. Simply put, 

probability mass curves the data space with horizons. 

What exactly it is meant by this is the subject of this paper but before 
proceeding further a few comments are in order. First of all, let us assume 
that we have a set {xi, . . . ,x n } of data. Each observation Xj consisting of 
p measurements that are thought as the p coordinates of a vector in H p . 
To make the data space into a probability space we endow IR P with the 
field of Borelians but nothing beyond that. In particular no a priori metric 
structure on the data space is assumed. The n observations are assumed to 
be n independent realizations of a given probability measure P on IR P . By 
the Lebesgue decomposition theorem, for every Borel set B we can write, 

P(B)= [ f(x)X(dx)+u(B) (1) 
Jb 

where v is the singular part of P that assigns positive probability mass to 
Borel sets of zero Lebesgue volume. Due to the existence of pathologies like 
the Cantor set in one dimension and its analogies in higher dimensions, the 
singular part v cannot be empirically estimated (see e.g. ||). Practically all 
of the papers on density estimation rule out the singular part of P a priori. 
The elimination of singularities by fiat has permitted the construction of a 
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rich mathematical theory for density estimation, but it has also ruled out a 
priori the study of models of mixed dimensionality (see |4j]) that may be nec- 
essary for understanding point masses and spacetime singularities coexisting 
with absolutely continuous distributions. 

We assume further that in the regions where f(x) > the density / is of 
class C 2 i.e., it has continuous second order partial derivatives. 

1.1 Nonparametrics with the World in Mind 

The road from Classical Newtonian Physics to the physics of today can be 
seen as a path paved by an increasing use of fundamental concepts that 
are statistical in nature. This is obvious for statistical mechanics, becoming 
clearer for quantum theory, and appearing almost as a shock in General 
Relativity. Not surprisingly there have been several attempts to take this 
trend further (see e.g. 0, || [7|, ||) in the direction of Physics as Inference. 

Now suppose for a moment that in fact some kind of restatement of the 
foundations of physics in terms of information and statistical inference will 
eventually end up providing a way to advance our current understanding of 
nature. As of today, that is either already a solid fact or remains a wild spec- 
ulation, depending on who you ask. In any case, for the trend to take over, it 
will have to be able to reproduce all the successes of current science and make 
new correct predictions. In particular it would have to reproduce General 
Relativity. Recall that the main lesson of General Relativity is that space 
and time are not just a passive stage on top of which the universe evolves. 
General Relativity is the theory that tells (through the field equation) how to 
build the stage (left hand side of the equation) from the data (right hand side 
of the equation). The statistical theory that tells how to build the stage of 
inference (the probabilistic model) from the observed data is: Nonparametric 
Density Estimation. It is therefore reassuring to find typical signatures of 
General Relativity in density estimation as this paper does. Perhaps Physics 
is not just a special case of statistical inference and all these are only coin- 
cidences of no more relevance than for example the fact that multiplication 
or the logarithmic function appear everywhere all the time. That may be 
so, but even in that case I believe it is worth noticing the connection for 
undoubtedly GR and density estimation have a common goal: The dynamic 
building of the stage. 

More formally. Let / be a multivariate probability density function with 
continuous second order partial derivatives. Consider the problem of esti- 
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mating the true value of f(z) > at a single point z, from n independent 
observations. It is shown that, fastest possible estimators (including the k- 
nearest neighbor and kernel as well as the rich class of estimators in || the- 
orems. 1]) have minimum asymptotic mean square errors when the space of 
observations is thought as conformally curved. The optimal metric is shown 
to be generated by the Hessian of / in the regions where the Hessian is defi- 
nite. Thus, the peaks and valleys of / are surrounded by horizons where the 
Hessian changes signature from Riemannian to pseudo-Riemannian. 

The result for the case of generalized k-nearest neighbor estimators || 
has circulated since 1988 in the form of a technical report pi. Recently 



I found that a special case of this theorem has been known since 1972 |1 1 
and undergone continuous development in the Pattern Recognition literature, 
(see e.g. 



12, 13, 14, 15|). 



2 Estimating Densities from Data 

The canonical problem of density estimation at a point z e IR P can be stated 
as follows: Estimate f(z) > from n independent observations of a random 
variable with density f . 

The most successful estimators of f(z) attempt to approximate the den- 
sity of probability at z by using the observations x±, . . . , x n to build both, a 
small volume around z and, a weight for this volume in terms of probability 
mass. The density is then computed as the ratio of the estimated mass over 
the estimated volume. The two classical examples are the k-nearest neighbor 
(knn) and the kernel estimators. 



2.1 The knn 

The simplest and historically the first example of a nonparametric density 
estimator is |Tj| the knn. The knn estimator of f(z) is defined for k 6 
{1,2, . . . ,n} as, 

hn(z) = ^ (2) 

where is the volume of the sphere centered at the point z G IR P of radius 
R(k) given by the distance from z to the kth-nearest neighbor observation. 
If A denotes the Lebesgue measure on 1R P we have, 
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X k = X(S(R(k))) (3) 

where, 

S(r) = {x E W : \\x-z\\ < r} (4) 

The sphere S(r) and the radius R(k) are defined relative to a given norm, 
|| • || in ]R P . The stochastic behavior of the knn depends on the specific value 
of the integer k chosen in (|2|). Clearly, in order to achieve consistency (e.g. 
stochastic convergence of h n (z) as n — > oo towards the true value of f(z) > 0) 
it is necessary to choose k = k(n) as a function of n. The volumes must 
shrink, to control the bias, and consequently we must have k/n — > for h n (z) 
to be able to approach a strictly positive number. On the other hand, we 
must have k — > oo to make the estimator dependent on an increasing number 
k of observations and in this way to control its variance. Thus, for the knn 
to be consistent, we need k to increase with n but at a rate slower than n 
itself. 

The knn estimator depends not only on k but also on a choice of norm. 
The main result of this paper follows from the characterization of the || • || 
that, under some regularity conditions, produces the best asymptotic (as 
n — > oo) performance for density estimators. 



2.2 The kernel 

If we consider only regular norms || • ||, in the sense that for all sufficiently 
small values of r > 0, 

\{S{r)) = \(S(l))r p = Pr p (5) 
then, the classical kernel density estimator can be written as: 

9niz) = WW) (6) 

where, 

M,= l -[ £ K^ixj-z)) (7) 

The smoothing parameter fi = fi(n) is such that k = [nfi p ] satisfies the 
conditions for consistency of the knn, K^-i(x) = K(fi~ 1 x) where the kernel 
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function K is a non negative bounded function with support on the unit 
sphere (i.e. K(x) = for ||sc|| > 1) and satisfying, 

/ K{x)dx = (3 (8) 

J\\x\\<l 

Notice that for the constant kernel (i.e. K(x) = 1 for ||x|| < 1) the estimator 
(Q) approximates f(z) by the proportion of observations inside S(fj,) over the 
volume of S(fi). The general kernel function K acts as a weight function 
allocating different weights K /Ji -i(xj — z) to the x/s inside S(n). To control 
bias (see ( j32|) below) the kernel K is usually taken as a decreasing radially 
symmetric function in the metric generated by the norm || • || . Thus, Jf „-i (xj — 
z) assigns a weight to Xj that decreases with its distance to z. This has 
intuitive appeal, for the observations that lie closer to z are less likely to fall 
off the sphere S(p), under repeated sampling, than the observations that are 
close to the boundary of S(fi). 

The performance of the kernel as an estimator for f(z) depends first and 
foremost on the value of the smoothness parameter /i. The numerator and 
the denominator of g n (z) depend not only on /i but also on the norm || ■ || 
chosen and the form of the kernel function K. As it is shown in theorem (||) 
these three parameters are inter-related. 

2.3 Double Smoothing Estimators 

The knn (0) and the kernel @ methods are two extremes of a continuum. 
Both, h n (z) and g n {z) estimate f(z) as probability-mass-per-unit-volume. 
The knn fixes the mass to the deterministic value k/n and lets the volume 
to be stochastic, while the kernel method fixes the volume \(S(n)) and lets 
the mass M M to be random. The continuum gap between (0) and @ is filled 
up by estimators that stochastically estimate mass and volume by smoothing 
the contribution of each sample point with different smoothing functions for 
the numerator and denominator (see 0). 

Let b > 1 and assume, without loss of generality that bk is an integer. 
The double smoothing estimators with deterministic weights are defined as, 
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where, 

ri/bk 

u)i = / u(u)du (10) 
J{i-i)/bk 

and o>(-) is a probability density on [0, 1] with mean c. 

3 The Truth as n oo ? 

In nonparametric statistics, in order to assess the quality of an estimator 
as an estimate for f(z), it is necessary to choose a criterion for judg- 
ing how far away is the estimator from what it tries to estimate. This is 
sometimes regarded as revolting and morally wrong by some Bayesian Fun- 
damentalists. For once you choose a loss function and a prior, logic alone 
provides you with the Bayes estimator and the criterion for judging its qual- 
ity. That is desirable, but there is a problem in high dimensional spaces. 
In infinite dimensional hypothesis spaces (i.e. in nonparametric problems) 



almost all priors will convince you of the wrong thing! (see e.g. [17, 1 



for a non-regular way out see fll9|). These kind of Bayesian nonparametric 



results provide a mathematical proof that: almost all fundamental religions 
are wrong, (more data can only make the believers more sure that the wrong 
thing is true!). An immediate corollary is that: Subjective Bayesians can't 
go to Heaven. Besides, the choice of goodness of fit criterion is as ad-hoc (an 
equivalent) to the choice of a loss function. 



3.1 The Natural Invariant Loss Function and Why the 
MSE is not that Bad 

The most widely studied goodness of fit criterion is the Mean Square Error 
(MSE) defined by, 

(MSE) =E\f n (z)-f(z)\ 2 (11) 

where the expectation is over the joint distribution of the sample x\, . . . ,x n . 
By adding and subtracting T = Ef n (z) and expanding the square, we can 
express the MSE in the computationally convenient form, 



(MSE) = E\f n {z)-T\ 2 + \T-f{z)\ 2 
= (variance) + (bias) 2 



(12) 
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By integrating ( |TTD over the z G H p and interchanging and / (OK by 
Fubbini's theorem since the integrand > 0) we obtain, 



(MISE) = eJ \f n (z)-f(z)\ 2 dz 



(13) 



The Mean Integrated Square Error (MISE) is just the expected L 2 distance 
of f n from /. Goodness of fit measures based on the (MSE) have two 
main advantages: They are often easy to compute and they enable the rich 
Hilbertian geometry of L 2 . On the other hand the (MSE) is unnatural and 
undesirable for two reasons: Firstly, the (MSE) is only defined for densities 
in L 2 and this rules out a priori all the densities in L 1 \ L 2 which is unaccept- 
able. Secondly, even when the (MISE) exists, it is difficult to interpret (as 
a measure of distance between densities) due to its lack of invariance under 
relabels of the data space. Many researchers see the expected L 1 distance 
between densities as the natural loss function in density estimation. The L l 
distance does in fact exist for all densities and it is easy to interpret but it 
lacks the rich geometry generated by the availability of the inner product in 
L 2 . A clean way out is to use the expected L 2 distance between the wave 
functions ip n = a/7« and ip = y/J. 

Theorem 1 The L 2 norm of wave functions is invariant under relabels of 
the data space, i.e., 



(14) 



\^ n (z) - ^(z)\ 2 dz = J \^ n (z') - ^(z')\ 2 dz' 
where z = h(z') with h any one-to-one smooth function. 



Proof: Just change the variables. From, the change of variables theorem 
the pdf of z' is, 



f(z>) = f(h(z')) 



d(h) 



d(z') 



(15) 



from where the wave function of z' is given by, 



^P(z') = ^(h(z')) 



d(h) 



d(z') 



1/2 



(16) 



Thus, making the substitution z = h(z') we get, 
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\ip n - ip\ 2 dz 



|Vfc(/»CO) - V>(M*'))I 2 



d(h) 



d{h) 



^ d(z>] 

- ljj\ 2 dz' 



1/2 



d{h) 



d{z>) 

1/2 



d(z') 



\ 2 dz' 



(17) 



Q.E.D. 

The following theorem shows that a transformation of the MSE of a con- 
sistent estimator provides an estimate for the expected L 2 norm between 
wave functions. 

Theorem 2 Let f n (z) be a consistent estimator of f(z). Then, 

f If E\fJz) — f(z)\ 2 

E / \ip n — ip\ 2 dz — — I — —— dz + (smaller order terms) (18) 

J 4 J J yZ) 

Proof: A first order Taylor expansion of yfx about Xq gives, 

1 (x — X ) 



X — WXq 



+ 0((x - Xq) ) 



(19) 



2 v /x^ 

Substituting x = f n (z),xo = f(z) into (|T^) squaring both sides and taking 
expectations we obtain, 



£|V„(z)-V(*)| 5 



lE\f n {z)-f(z) 



+ o(E\f n (z)-f(z)f 



4 f(z) 

integrating over z and interchanging E and / we arrive at ( 18 ) 



(20) 



Q.E.D. 

Proceeding as in the proof of theorem |l| we can show that 



l/» - /I s 



dz 



\fn-ff 



dz' 



(21) 



/ J f 

where, as before, z z' is any one-to-one smooth transformation of the data 



space and / is the density of z' . Thus, it follows from ([H]) that the leading 
term on the right hand side of fll8|) is also invariant under relabels of the data 
space. The nice thing about the L 2 norm of wave functions, unlike (^1|), is 
that it is defined even when f(z) =0. 
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4 Some Classic Asymptotic Results 

We collect here the well known Central Limit Theorems (CLT) for the knn 
and kernel estimators together with some remarks about nonparametric den- 
sity estimation in general. The notation and the formulas introduced here 
will be needed for computing the main result about optimal norms in the 
next section. 

Assumption 1 Let f be a pdf on 1R P of class C 2 with non singular Hessian, 
H(z) at z G 1R P ; and with f(z) > 0, i.e., the matrix of second order partial 
derivatives of f at z exists, it is non singular and its entries are continuous 
at z. 

Assumption 2 Let K be a bounded non negative function defined on the 
unit sphere, So = {x G IR P : ||x|| < 1} and satisfying, 

[ K(x)dx = X(Sq) = 13 (22) 

J\\x\\<l 

[ xK(x)dx = G W (23) 

J||x||<l 



Theorem 3 (CLT for knn) Under assumption [I, if k — k(n) is taken in 
the definition of the knn ^) in such a way that for some a > 

lim n- 4/{p+4) fc = a (24) 

n— *oo 

then, if we let Z n = \/k(h n (z) — f(z)) we have, 

2s.K*> * f ) - L^<*» ( Jl ^F) dv (25) 

where, 

B(2) =(A)){ r " 2 "t SI ^ (2) ^} (26) 

and, 



V(z) = f(z) 



(27) 
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Proof: This is a special case of ||, theorem3.1]. 

Theorem 4 (CLT for kernel) Under assumptions^, and^ if // = fi(n) is 
taken in the definition of the kernel (j^) in such a way that for some a > 0, 
k = [n/x p ] satisfies then, if we let Z n = ^/k(g n (z) — f(z)) we have $ib\ ) 
where now, 

B{z) = \ [fi' 1 1 ^ <x xTH { z ) xK { x )dx 

and, 

V(z) = f(z){(3- 2 J M<i K 2 (x)dx} (29) 

Proof: The sample x±, . . . , x n is assumed to be iid / and therefore the kernel 
estimator g n (z) given by (^j) and (|7|) is a sum of iid random variables. Thus, 
the classic CLT applies and we only need to verify the rate ( p4|) and the 
asymptotic expressions for the bias (p8|) and variance (|29|) . We have, 




E[ 9n (z)] = ^\±jK(^) md x 3 (30) 



= Jjjj K(y)f(z + fiy)^dy (31) 
= j^jT {/(*) + ^ V /W ■ 2/ + ^V T H{z)y + o(/i 2 )| £#32) 
= f{z) + ^jy T H{z)yK{y)dy + o{f) (33) 

where we have changed the variables of integration to get (|3l|), used assump- 
tion [1] and Taylor's theorem to get (j3~2"|) and used assumption ^ to obtain 
(|33|). For the variance we have, 



vax(g n (z)) 



nf3 2 n 2 P 
1 



var *)//*)) 

f(z + fiy)K 2 (y)^dy 



(34) 



n(3 2 [i 2 P \J\\y\\<i 



4 SOME CLASSIC ASYMPTOTIC RESULTS 



11 



J^fiz + wWyWdyj | (35) 
f K\y)dy + o(—\ (36) 



where we have used var(i^) = EK 2 — (EK) 2 and changed the variables of 
integration to get ([35|), used assumption [1] and (Oth order) Taylor's theorem 
to get (|36|) . Hence, the theorem follows from (|33"D and (^) after noticing 
that ( P^D and k = ny? imply, 

I — o 4+p 9 / 4 p+4 p+4 

Vfc/T = k 2 p n~ z/p = (n p+*k) 2 p — > a 2 ? (37) 
— = 7 = 1 (38) 

Q.E.D. 

Theorem 5 (CLT for double smoothers) Consider the estimator f n (z) 
defined in (fj^). Under assumptions 0, ||, and fl^l j we let Z n = vk(f n (z) — 



f(z)) we have ( 25 ) where now, 



P+4 



B(z) = ( 2w a f ( P z)]2/p ) P' 1 \ J MM x T H(z)x [K(x) + A ] dx | (39) 

and, 



V(z) = f(z) jr 1 J M<i K\x)dx - A x | 



(40) 



with, 



o 2 Ip /•! 2 

A = / M 1+ ^^(M)rfM- 1 (41) 



c Jo 



/■i r 



Ax = l-c-%- 1 |y cu(x)rfa;| rfy (42) 

Proof: See || theorem3.1]. Remember to substitute X by (3~ 1 K since in 
the reference the Kernels are probability densities and in here we take them 
as weight functions that integrate to (3. 
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4.1 Asymptotic Mean Square Errors 

Let f n be an arbitrary density estimator and let Z n = yk{f n {z) — f (z)) . Now 
suppose that f n (z) is asymptotically normal, in the sense that when k = k(n) 
satisfies fl2"4|) for some a > 0, we have (^J) true. Then, all the moments of 
Z n will converge to the moments of the asymptotic Gaussian. In particular 
the mean and the variance of Z n will approach B(z) and V(z) respectively. 
Using, (0) and (|24D we can write, 



lim n^E\f n {z) - /(z)| 2 = ^ + *M (43) 

We call the right hand side of ( f43"D the asymptotic mean square error (AMSE) 
of the estimator f n (z). The value of a can be optimized to obtain a global 
minimum for the (AMSE) but it is well known in nonparametrics that the 
rate n~ 4 ^ p+ ^ is best possible (in a minimax sense) under the smoothness 
asumption |T] (see e.g. J2DJ). We can take care of the knn, the kernel, and the 
double smoothing estimators simultaneously by noticing that in all cases, 

(AMSE) = a^- 1 + a 2 a 4/p (44) 
has a global minimum of, 



4 



(AMSE)* = | (1 + 4/p) (f) P+A | «r ai + ' (45) 



/ pai \ p+4 



achieved at, 



= (46) 

Replacing the corresponding values for a\ and a 2 for the knn, for the kernel, 
and for the double smoothing estimators, we obtain that in all cases, 

(AMSE)* = (const, indep. of /) { f(z) [ * } (47) 

where, 



A = f x T H(z)xG(x)dx (48) 

J\\x\\<l 
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dzf 



(49) 



with G[x) = 1 for the knn, G(x) = K(x) for the kernel, G(x) = K(x) + \q for 
the double smoothers (see ([!9|) and (f4T|)) and, if e,,- denotes the jth canonical 
basis vector (all zeroes except a 1 at position j), 



Pi 



x\\<\ 



(x ■ ej) 2 G(x)dx 



(50) 



Notice that (||) follows from (glf), (|3|) and the fact that H(z) is the Hes- 
sian of / at z. The generality of this result shows that fl47|) is typical for 
density estimation. Thus, when f n is either the knn, the kernel, or one of 
the estimators in (||, theorem3.1]), we have: 



lim n A ^E\f n (z)-f(z)\ 2 >cf(z) 




(51) 



The positive constant c may depend on the particular estimator but it is 
independent of /. Dividing both sides of (|5T|) by f(z), integrating over z, 
using theorem and interchanging E and / we obtain, 



Jim n A/i - p+A) E J \ij) n {z) - tp{z)\ 2 dz > 4c J 





A 


I 


V>(z) 



_2p_ 

P+i 



dz 



(52) 



The worst case scenario is obtained by the model f = tp 2 that maximizes the 
action given by the right hand side of (f 



C 



1 



ip{z] 



d 2 jj 2 
~dzf 



_2p_ 

P+i 



dz 



(53) 



This is a hard variational problem. However, it is worth noticing that the 
simplest case is obtained when the exponent is 1, i.e. when the dimension 
of the data space is p = 4. Assuming we were able to find a solution, this 
solution would still depend on the p parameters pi, ■ ■ ■ , p p . A choice of p/s 
is equivalent to the choice of a global metric for the data space. Notice also, 
that the exponent becomes 2 for p = oo and that for p > 3 (but not for 
p = 1 or 2) there is the possibility of non trivial (i.e. different from uniform) 
super-efficient models for which estimation can be done at rates higher than 



n 



-4/(p+4) 



These super-efficient models are characterized as the non negative 
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solutions of the Laplace equation in the metric generated by the p/s, i.e., 
non negative (f(z) > 0) solutions of, 

P Q2f 

EPCT(-) = (54) 

Recall that there are no non trivial (different from constant) non negative 
super-harmonic functions in dimensions one or two but there are plenty of 
solutions in dimension three and higher. For example the Newtonian poten- 
tials, 



with the norm, 



/(*) = c\\z\\;<*~*> (55) 



l* = £Hy ( 56 ) 



3=1 



will do, provided the data space is compact. The existence of (hand picked) 
super-efficient models is what made necessary to consider best rates only in 
the minimax sense. Even though we can estimate a Newtonian potential 
model at faster than usual nonparametric rates, in any neighborhood of the 
Newtonian model the worst case scenario is at best estimated at rate n -4// ( p+4 ) 
under second order smoothness conditions. 



5 Choosing the Optimal Norm 

All finite (p < oo) dimensional Banach spaces are isomorphic (as Banach 
spaces) to H p with the euclidian norm. This means, among other things, 
that in finite dimensional vector spaces all norms generate the same topology. 
Hence, the spheres {x G IR P : ||x|| < r} are Borelians so they are Lebesgue 
measurable and thus, estimators like the knn @ are well defined for arbitrary 
norms. It is possible, in principle, to consider norms that are not coming from 
inner products but in this paper we look only at Hilbert norms || • ||a of the 
form, 

\\z\\ 2 A = z T A T Az (57) 

where A G A with A defined as the open set of real non-singular p x p 
matrices. For each A G A define the unit sphere, 
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S A = {x G H p : x T A T Ax < 1} (58) 

its volume, 

Pa = X(3 A ) = I X(dx) (59) 
JSa 

and the A-symmetric (i.e. || • ||^ radially symmetric) kernel, K A , 

K A (x) = (K o A) (x) = K(Ax) (60) 

where K satisfies assumption |] and it is /-symmetric, i.e., radially symmetric 
in the euclidian norm. This means that K(y) depends on y only through the 
euclidian length of y, i.e. there exists a function F such that, 

K(y) = F(y T y) (61) 

The following simple theorem shows that all A-symmetric functions are really 
of the form flS0f). 

Theorem 6 For any A G A, K is Asymmetric if and only if we can write 

K(x) = K(Ax) for all x G 1R P (62) 
for some I -symmetric K . 

Proof: K(x) is A-symmetric iff K(x) = F(\\x\\ A ) for some function F. 

Choose K(x) = K(A~ 1 x). This K is /-symmetric since K(x) = F {[[AA^x) 7 \AA~ 1 x)^J = 

F(x T x). More over, K(x) = K{A^ 1 {Ax)) = K(Ax). Thus, ( p^ ) is necessary 
for A-symmetry. It is also obviously sufficient since the assumed /-symmetry 
of K in implies that K(x) = F((Ax) T (Ax)) = F(\\x\\ 2 A ) so it is A- 
symmetric. 
Q.E.D. 

An important corollary of theorem ^| is, 

Theorem 7 Let A, B G A. Then, K is AB-symmetric if and only if Kb-\ 
is A-symmetric. 
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Proof: By the first part of theorem || we have that K = K o A o B with K 
some /-symmetric. Thus, K o B^ 1 = K o A is A-symmetric by the second 
part of theorem |6|. 
Q.E.D. 

Let us denote by /3(A, K) the total volume that a kernel K assigns to the 
unit A-sphere Sa, i.e., 



/3(A, 1C) = / K(x)dx (63) 

In order to understand the effect of changing the metric on a density esti- 
mator like the kernel (0), it is convenient to write g n explicitly as a function 
of everything it depends on; The point z, the metric A, the A-symmetric 
kernel function K, the positive smoothness parameter /i and, the data set 
{xi, . . . , x n }. Hence, we define, 

g n (z-A,K,fi,{x l ,...,x n })= n j=1 - V — '- (64) 

The following result shows that kernel estimation with metric AB is equiv- 
alent to estimation of a transformed problem with metric A. The explicit 
form of the transformed problem indicates that a perturbation of the met- 
ric should be regarded as composed of two parts: Shape and volume. The 
shape is confounded with the form of the kernel and the change of volume is 
equivalent to a change of the smoothness parameter. 

Theorem 8 Let A, B £ A, \i > 0, and K an AB -symmetric kernel. Then, 
for all z G 1R P and all data sets {xi . . . , x n } we have, 

g n (z; AB, K, (i, {x x , x n }) = g n (Bz; A, KoB' 1 , |fi|" 1/p /i, {Bx x , Bx n }) 

(65) 

where \B\ denotes the determinant of B and B = \B\ l f p B is the matrix B 
re-scaled to have unit determinant. 

Proof: To simplify the notation let us denote, 

Substituting AB for A in ( |6"4"D and using theorem |^ we can write the left 
hand side of (1651) as, 
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P(AB,K)fiP 



l -TJUKoA) 



P(A,KoA)(ji B y 



where K is some /-symmetric kernel and we have made the change of vari- 
ables x = B~ x y in j3(AB,K) (see fl6"3|) ). The last expression is the right 
hand side of (|65|). Notice that, K o A = K B -i is A-symmetric. 
Q.E.D. 



5.1 Generalized knn Case with Uniform Kernel 

In this section we find the norm of type Q57D that minimizes ([171) for the 
estimators of the knn type with uniform kernel which include the double 
smoothers with K{x) — 1. As it is shown in theorem ^ a change in the 
determinant of the matrix that defines the norm is equivalent to changing 
the smoothness parameter. The quantity (^) to be minimized was obtained 
by fixing the value of the smoothness parameter to the best possible, i.e. the 
one that minimizes the expression of the (AMSE) (f43f ). Thus, to search for 
the best norm at a fix value of the smoothness parameter we need to keep 
the determinant of the matrix that defines the norm constant. We have, 

Theorem 9 Consider the problem, 

min ( / x T H[z)xdx ) (67) 

where the minimum is taken over all p x p matrices with determinant one, 
Sa is the unit A-ball and H(z) is the Hessian of the density f G C 2 at z 
which is assumed to be nonsingular. 

When H(z) is indefinite, i.e. when H(z) has both positive and negative 
eigenvalues, the objective function in achieves its absolute minimum 
value of zero when A is taken as, 



A = c^diagJ-^, . . . , J Jo-, J^±±, . . . , (68) 




where the £j are the absolute value of the eigenvalues of H(z), m is the 
number of positive eigenvalues, M is the matrix of eigenvectors and c is a 
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normalization constant to get detA = 1 (see the proof for more detailed 
definitions). 

If H(z) is definite, i.e. when H(z) is either positive or negative definite, 
then for all p x p real matrices A with det A = 1 we have, 

> ^-^ p\detH(z)\ 1/p (69) 
p + 3 



/ x T H(z)xdx 

J Sa 



IS A 

with equality if and only if, 

\H 1 J 2 (z)\ 1 Ip 



1 /2 

where H+ (z) denotes the positive definite square-root of H{z) (see the proof 
below for explicit definitions). 

Proof: Since / G C 2 the Hessian is a real symmetric matrix and we can 
therefore find an orthogonal matrix M that diagonalizes H(z), i.e. such that, 

H(z) = M T DM with M T M = I (71) 

where, 

D = diag (£i, 6, • • • , Cm, -£m+i, • • • , -f P ) (72) 

where all the £j > and we have assumed that the columns of M have been 
ordered so that all the m positive eigenvalues appear first and all the negative 
eigenvalues — £ m +i, . . . , — £ p appear last so that ([71]) agrees with Q72]). Split 
D as, 



D = diag(fi, . . . ,Cm,0, . . . ,0) - diag(0, . . . ,0,f m+ i, . . . ,Q 

= D + -D_ (73) 

and use (1711) and (T73|) to write, 



//(:) = M' I). M - M' I) M 

, T 



[d^m) (d 1 J 2 m) - (d^m) (d 1 1 2 m) 

= - S T S_ (74) 

Using ([74] ) and the substitution y = Ax we obtain, 

/ x T H(z)xdx = f y T fy4 _1 ) T (sTS + — £ T X_) A~ 1 ydy 
Js A Jy T y<i v J v 7 

= / {EA-^EA-tydy (75) 
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where, 

E = £+ + £_ = {D + + D_) 1,2 M (76) 
and < ., . > denotes the pseudo-Riemannian inner product, 

m P 

( x , y) = Yl %l yi ~ Z ( 77 ) 

i=l i=m+l 

By letting G = diag(l, . . . , 1, —1, . . . , —1) (i.e. m ones followed by p — m 
negative ones) be the metric with the signature of H(z) we can also write 

(x,y)=x T Gy (78) 

Let, 

5= [6i|6 2 |...|6 p ] = EA~ 1 (79) 

where bi, . . . ,b p denote the columns of B. Substituting ( fT9"D into (|75|), using 
the linearity of the inner product and the symmetry of the unit euclidian 
sphere we obtain, 



where p stands for, 



/ x T H(z)xdx = / (By, By) dy 
Js A Jy T y<i 

= ££<M*> / y j y k dy (so) 
j k Js ' 

j k 

= p£(Mi> (81) 

3=1 

/S/ v 7 p + 3 

From fl79|) and (^l|) it follows that problem (|67|) is equivalent to, 

^(l^Mi)) ( 83 ) 

When -ff(^) is indefinite, i.e. when m ^ {0,p} it is possible to choose the 
columns of B so that J2j (bj,bj) = achieving the global minimum. There 
are many possible choices but the simplest one is, 

B = c • diag( v / P — m, VP ~ m i ■ ■ ■ > VP ~ m i V™, V™, ■ ■ ■ , V 7 "^) (84) 

V v ' v ' 

m p—m 
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since, 



(bj, bj) = c 2 m(^p — m) 2 — c 2 (p — m)(v / m) 2 = 0. (85) 

3=1 

The scalar constant c needs to be adjusted to satisfy the constraint that 
det-B = detS. From (|79|), and (76) we obtain that the matrix for the 
optimal metric can be written as, 



A = B~ X T, 



y/p — m 



From (|86|) we get, 



(86) 



A A 



p — m 

Finally from (|74D we can rewrite (j87|) as, 



m 



(87) 



A A 



c~ 2 M T 



c~ 2 M T diag( 



p — m m 



6 



M 

sm+1 



p — m' ' p — m' m 



...,^)M 

m 



(88) 



(89) 



Notice that when p — m = m (i.e. when the number of positive equals the 
number of negative eigenvalues of H(z)) the factor 1/m can be factorized 
out from the diagonal matrix in ( |89"D and in this case the optimal A can be 

expressed as, 



.4 



Hi /2 (z) 



\Hl /2 (z)\ l /P 

where we have used the positive square-root of H(z) defined as, 



Hi /2 (z) 



diag(V6 



(90) 



(91) 



In all the other cases for which H(z) is indefinite, i.e. when m {0,p/2,p} 
we have, 



A = c 1 diae 



It 



p — m 



p — m 



m+l 

m 



m 



(92) 
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The normalization constant c is fixed by the constraint that det A = 1 as, 

m (p— m) 1 

c= (p — m)~2pm ~\ det H(z)\*p (93) 

This shows 

Let us now consider the only other remaining case when H(z) is definite, 
i.e. either positive definite (m = p) or negative definite (m = 0). Introducing 
Ao as the Lagrange multiplier associated to the constraint det B = det £ we 
obtain that the problem to be solved is, 

min £(&!, b 2 , . . . , b p , A ) (94) 

6i,...bp,Ao 

where the Lagrangian L is written as a function of the columns of B as, 
C(b u 6a, ... , 6 P , A ) = ^Tj ft,-) j - 4A (det(6 1 , . . . , b p ) - det E) (95) 

The — 4Ao instead of just Ao is chosen to simplify the optimality equations 
below. The optimality conditions are, 

dC dC 

— = for j = l,...,p and ^ = ( 96 ) 

where the functional partial derivatives are taken in the Frechet sense with 
respect to the column vectors by The Frechet derivatives of quadratic and 
multi linear forms are standard text-book exercises. Writing the derivatives 
as linear functions of the vector parameter h we have, 

^(bj,b 3 )(h) = 2 <&,-,/») (97) 

jj-det(&i,...,& p )(/0 = det(6i,..., Jt^ ,...,6p) (98) 
J j-th col. 

Thus, using ( |9^) and (p8|) to compute the derivative of (p5|) we obtain that 
for all h and all j — 1, . . . , p we must have, 

— (/i) = 2 |£ (6 fc , 6 fe }| 2 ft) - 4A det(6 ls . . . , h, . . . , b p ) = (99) 
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When Y,k < bk,bk >^ we can rewrite (|99|) as, 

h) = Cq 1 det^h, . . . , h, ...,b p ) 



J 3i 



(100) 



But now we can substitute h = bi with i ^ j into ( |100| ) and use the fact that 
the determinant of a matrix with two equal columns is zero, to obtain, 



(bj, bi) = for all i ^ j. 
In a similar way, replacing h = bj into ( |100| ), we get 

{bj, bj) = Cq 1 det B = c 



(101) 



(102) 



where c is a constant that needs to be fixed in order to satisfy the constraint 
that det B = det S. We have shown that the optimal matrix B must have or- 
thogonal columns of the same length for the G-metric. This can be expressed 
with a single matrix equation as, 



B T GB = cl 

Substituting ([T^) into ( |103| ) and re-arranging terms we obtain, 



A T A = c- 1 E T GE 



A 1 A 



c- 1 (S^ + Si)G(S + + S_ 
c" 1 (S^S + - S T S_) 
c~ x H{z) 



From ([TOD, (0) and (]8lTj we obtain 



x T H(z)x dx 



s A 



> 



pp\c\ 



(103) 
(104) 

(105) 
(106) 



and replacing the values of p and c we obtain (^9]) . 
Q.E.D. 

5.2 Yet Another Proof When The Hessian is Definite 

Consider the following lemma. 

Lemma 1 Let A, B be two px p non-singular matrices with the same deter- 
minant. Then 

IvWldy (107) 



Sa 



I *^ 1 1 



Sb 
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Proof: Just split Sa and Sb as, 



S A = (S A S B )U(S A S B ) (108) 
S B = (S B S A )U(S B S C A ) (109) 



and write, 



||x||o<ix= / 11x11 %dx — / \x^ B dx -\- / ||a;|||(ia; (HO) 

S A JSb JS a S b JS a S b 



Now clearly, 



min \\x\\ 2 B > 1 > max \\y\\ B (m) 

x&S A S c B yeS c A S B 



from where it follows that, 



\\x\\ B dx > min \\x\\ 2 B / dx (112) 

s A s c B xes A s B Js A s% 

> max \\y\\% dy (113) 

yeS c A S B Js A s B 

> / \\y\\%dy (H4) 

Js A s B 

where (|112 ) and (|114j) follow from To justify the middle inequality 

(|113|) notice that from (|108|) , ( |109| ) and the hypothesis that \A\ = \B\ we can 
write, 

/ dx + dx = dy + dy (H5) 

Js A s B Js A s B Js A s B Js a s b 

The conclusion ( |107p follows from inequality (|114 ) since that makes the last 

two terms in ( |110| ) non-negative. 

Q.E.D. 

If B is a nonsingular matrix we define, 



* = d^ (116) 



An immediate consequence of lemma |T| is, 

Theorem 10 IfH(z) is definite, then for allpxp matrices with \ A\ = 1 we 
have, 

|2 j„ ^ / |i ||2 



A = / ||ar|| Hl/aw da; > / M\ H ^{z) dx ( 117 ) 
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Proof: 

A = \H(z)\ 1/p [ \\x\\% 1/2 dx (118) 

Sa 

> \H(z)\ 1/p f \\x\\% 1/2 , z) dx (119) 

= I \\x\\m/ Hz) dx (120) 

where we have used lemma [I] to deduce the middle inequality ( |119j ) . 
Q.E.D. 

5.3 Best Norm With General Kernels 

In this section we solve the problem of finding the optimal norm in the general 
class of estimators (||). 

Before we optimize the norm we need to state explicitly what it means to 
do estimation with different kernels and different norms. First of all a general 
kernel function is a nonnegative bounded function defined on the unit sphere 
generated by a given norm. Hence, the kernel only makes sense relative to 
the given norm. To indicate this dependence on the norm we write Ka for 
the kernel associated to the norm generated by the matrix A. We let 

K A = KoA (121) 

where K = Kj is a fix mother kernel defined on the euclidian unit sphere. 
Equation ( |121| ) provides meaning to the notion of changing the norm without 



changing the kernel. What this means is not that the kernel is invariant under 
changes of A but rather equivariant in the form specified by ( |121| ). Recall also 



that a proper kernel must satisfy (|22|) . To control bias we must also require 



the kernels to satisfy (^). It is easy to see (just change the variables) that 
if the mother kernel K has these properties so do all its children with 
the only proviso that \A\ = 1 in order to get (|22|). Notice also that radial 
symmetry of K is a sufficient but not a necessary condition for (p3|). 

The optimization of the norm with general kernels looks more complicated 
than when the kernel is uniform since the best (AMSE)* also depends on 
Is A K\{x)dx. Consider the double smoothing estimators, which are the most 
general case treated in this paper. From, (^), (f5]) and (|45|) we have, 



(AMSE)* = (const.) ^ J^K 2 A (x)dx - f(z) (4rx)' + * i 122 ) 
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where the constant depends only on the dimension of the space. Even though 
the dependence of (|122|) on A looks much more complicated than (|47|) this is 
only apparently so. In fact the two expressions define very similar optimiza- 
tion problems as we now show. 

First notice that the search for best A must be done within the class 
of matrices with a fix determinant. For otherwise we will be changing the 
value of the smoothness parameter that was fixed to the best possible value 
in order to obtain ( |122|) . If we let \A\ = 1 we have, 



f K{x)dx = f3= f dx = A(5j) (123) 

JSa JSa 

We also have that, 

/ K\{y)dy=f K 2 (Ay)dy= [ K 2 (y)dx (124) 

JSa JSa J St 



From ( p.23| ) and (124) we deduce that the term in ( |122j ) within cursive brack- 
ets is the same for all matrices A and it depends only on the fix kernel K. 
Finally notice that the value of A in ( |122| ) is given by 

A= f x T H(z)xG{Ax)dx (125) 

JSa 

where G(x) = K(x) + A in the general case. By retracing again the steps 
that led to (|8lD we can write, 

A = ££<M*> / y j y k G(y)d y (126) 

3 k JSl 

j k 

= E<Mi>Pi(G) ( 127 ) 
3=1 



Pj (G) = j (x' J ) 2 G(x)dx (128) 



where now, 

Pi(G) = 

I S x 

There are three cases to be considered. 

1. All the Pj(G) = p for j — 1, . . . , p. The optimization problem reduces 
to the case when the kernel is uniform and therefore it has the same 
solution. 
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2. All the pj(G) have the same sign, i.e. they are all positive or all nega- 
tive. If e.g. all pj > just replace bj with y/p]bj and use the formulas 
obtained for the uniform kernel case. 

3. Some of the Pj(G) are positive and some are negative. This case can be 
handled like the previous one after taking care of the signs for different 
indices j. 

The first case is the most important for it is the one implied when the kernels 
are radially symmetric. The other two cases are only included for complete- 
ness. Clearly if we do estimation with a non radially symmetric kernel the 
optimal norm would have to correct for this arbitrary builtin asymmetry, 
effectively achieving at the end the same performance as when radially sym- 
metric kernels are used. The following theorem enunciates the main result. 



Theorem 11 In the general class of estimators (^) with radially symmetric 
(mother) kernels, best possible asymptotic performance (under second order 
smoothness conditions) is achieved when distances are measured with the best 
metrics obtained when the kernel is uniform. 



6 Asymptotic Relative Efficiencies 

The practical advantage of using density estimators that adapt to the form 
of the optimal metrics can be measured by computing the Asymptotic Rel- 
ative Efficiency (ARE) of the optimal metric to the euclidian metric. Let us 
denote by AMSE(I) and AMSE(H(z)) the expressions obtained from (|122| ) 



when using the euclidian norm and the optimal norm respectively. For the 
Euclidean norm we get, 



*(ptr H(z)) 



V 
p+4 



AMSE(I) = (const.) (/T 1 / K 2 (x)dx - XxV^ f(z) . , 

I J Si J V J\ z ) J 

(129) 

where tr stands for the trace since, 

A = / x T H(z)xG(x)dx = Yh ii (z) f x i x i G(x)dx = ptr H(z) (130) 
Jsi t7 hi 
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Using ( |123| ), ( 124 ) and (|69| ) we obtain that when H(z) is definite, 

AMSE(H(z)) = (131) 

(const.) |/3 ^ K 2 {x)dx-X 1 j f(z) — — ^ — - 1 

Hence, when #(2) is definite the v4.RE is, 

Mg£(J) / tr g(s) \ & 
AMSE(H(z)) [pldetHWJ 1 ' 

If £1, . . . ,£ p are the absolute value of the eigenvalues of H(z) then we can 
write, 

AflB= |^i^ / arith.meauoffen ^ 
^11 C') / \geom. mean of {qj} J 



It can be easily shown that the arithmetic mean is always greater or equal 
than the geometric mean (take logs, use the strict concavity of the logarithm 
and Jensen's inequality) with equality if and only if all the £j's are equal. 
Thus, it follows from ( |133| ) that the only case in which the use of the optimal 
metric will not increase the efficiency of the estimation of the density at 
a point where the Hessian is definite is when all the eigenvalues of H(z) 
are equal. It is also worth noticing that the efficiency increases with p, the 
dimension of the data space. There is of course infinite relative efficiency in 
the regions where the H(z) is indefinite. 



7 An Example: Radially Symmetric Distri- 
butions 

When the true density f(z) has radial symmetry it is possible to find the 
regions where the Hessian H(z) is positive and negative definite. These 
models have horizons defined by the boundary between the regions where 
H(z) is definite. We show also that when and only when the density is linear 
in the radius of symmetry, the Hessian is singular in the interior of a solid 
sphere. Thus, at the interior of these spheres it is impossible to do estimation 
with the best metric. 
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Let us denote simply by L the log likelihood, i.e., 

f(z) = exp(L) (134) 

If we also denote simply by Lj the partial derivative of L with respect to Zj 
then, 

J£ = /(*)^ ( 135 ) 
<hJ |^ Lj + f(z) Lij = f(z) {LiLj + Ly} (136) 



and also, 



dzidzj dz, 



where we have used ( |135| ) and the definition L^ = It is worth notic- 
ing, by passing, that (|136|) implies a notable connection with the so called 
nonparametric Fisher information 1(f) matrix, 

J H(z)dz = X(f)-l{f) = (137) 

our main interest here however, is the computation of the Hessian when the 
density is radially symmetric. Radial symmetry about a fix point p. G IR P is 
obtained when L (and thus / as well) depends on z only through the norm 
|| z — /i||y-i for some symmetric positive definite p x p matrix V. Therefore 
we assume that, 

L = L(- l -{z-^ T V-\z-ri) (138) 

from where we obtain, 



Li = (-v % \z-n))L' (139) 
L^ = L"/(z- p)v j \z- //)- L'v ij (140) 

where v % ' and denote the i-th row and ij-ih entries of V~ x respectively. 
Replacing ( |139[ ) and (140) into ( J136[ ) , using the fact that V" 1 is symmetric 
and that v^'(z-fi) is a scalar and thus, equal to its own transpose (z — /i) T f ' J , 
we obtain 

H(z) = f(z)L' | U' + V-\z -n){z- /i) T - /} V- 1 (141) 



We have also assumed that L' is never zero. With the help of (|141|) we can 
now find the conditions for H(z) to be definite and singular. Clearly H(z) 
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will be singular when the determinant of the matrix within curly brackets 
in ( |141| ) is zero. But that determinant being zero means that A = 1 is an 
eigenvalue of 

(L' + L"/L')V-\z-v)(z-ii) T (142) 

and since this last matrix has rank one its only nonzero eigenvalue must be 
equal to its own trace. Using the cyclical property of the trace and letting 

we can write, 

Theorem 12 The Hessian of a radially symmetric density is singular when 
and only when either V = or 

L ' + ^ L '-T V < 143 > 

Notice that theorem [12| provides an equation in y after replacing a par- 
ticular function L = L(y). Theorem can also be used to find the functions 
L(y) that will make the Hessian singular. Integrating ( |143| ) we obtain, 



L(y)+ log L'( y ) = -~log(\y\) + c (144) 

and solving for L', separating the variables and integrating we get, 

L(y) = log (ay/[y\ + bj (145) 

where a and b are constants of integration. In terms of the density equation 
(|145|) translates to, 

f(z) = a\\z - fi\\ v -i +b (146) 

Hence, in the regions where the density is a straight line as a function of 
r = \\z — /u||y-i the Hessian is singular and estimation with best metrics is 



not possible. Moreover, from ( |141| ) we can also obtain the regions of space 
where the Hessian is positive and where it is negative definite. When L' > 0, 
H(z) will be negative definite provided that the matrix, 



I — (L' + L"/L')V-\z -fj)(z- fxf (147) 
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is positive definite. But a matrix is positive definite when and only when all 
its eigenvalues are positive. It is immediate to verify that £ is an eigenvalue 
for the matrix ( |147| ) if and only if (1 — £) is an eigenvalue of the matrix (|142 ) 



The matrix ( |142|) has rank one and therefore its only nonzero eigenvalue is 
its trace so we arrive to, 

Theorem 13 When, 

L ' + i_l 0gL '<_J_ (148) 
dy 2y 

H(z) is negative definite when L' > and positive definite when V < 0. 
When, 

L' + ± l og L'>~ (149) 
dy 2y 

H(z) is indefinite. 

For example when f(z) is multivariate Gaussian L(y) = y + c so that 
L' = 1 and the horizon is the surface of the V A_1 -sphere of radius one i.e., 
(z — nYV' 1 ^ — fi) = 1. Inside this sphere the Hessian is negative definite 
and outside the sphere the Hessian is indefinite. The results in this section 
can be applied to any other class of radially symmetric distributions, e.g. 
multivariate T which includes the Cauchy. 



8 Conclusions 

We have shown the existence of optimal metrics in nonparametric density 
estimation. The metrics are generated by the Hessian of the underlying 
density and they are unique in the regions where the Hessian is definite. The 
optimal metric can be expressed as a continuous function of the Hessian in 
the regions where it is indefinite. The Hessian varies continuously from point 
to point thus, associated to the general class of density estimators @ there is 
a Riemannian manifold with the property that if the estimators are computed 
based on its metric the best asymptotic mean square error is minimized. The 
results are sufficiently general to show that these are absolute bounds on the 
quality of statistical inference from data. 

The similarities with General Relativity are evident but so are the differ- 
ences. For example, since the Hessian of the underlying density is negative 
definite at local maxima, it follows that there will be a horizon boundary 



9 ACKNOWLEDGMENTS 



31 



where the Hessian becomes singular. The cross of the boundary corresponds 
to a change of signature in the metric. These horizons almost always are null 
sets and therefore irrelevant from a probabilistic point of view. However, 
when the density is radially symmetric changing linearly with the radius we 
get solid spots of singularity. There is a qualitative change in the quality of 
inference that can be achieved within these dark spots. But unlike GR, not 
only around local maxima but also around local minima of the density we 
find horizons. Besides, it is not necessary for the density to reach a certain 
threshold for these horizons to appear. Nevertheless, I believe that the in- 
fusion of new statistical ideas into the foundations of Physics, specially at 
this point in history, should be embraced with optimism. Only new data will 
(help to) tell. 

There are many unexplored promising avenues along the lines of the sub- 
ject of this paper but one that is obvious from a GR point of view. What is 
missing is the connection between curvature and probability density, i.e. the 
field equation. I hope to be able to work on this in the near future. 

The existence of optimal metrics in density estimation is not only of the- 
oretical importance but of significant practical value as well. By estimating 
the Hessian (e.g. with kernels that can take positive and negative values, see 
|2l| ) we can build estimators that adapt to the form of the optimal norm with 
efficiency gains that increase with the number of dimensions. The antidote 
to the curse of dimensionality! 
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